October 7, 2015
rnorm(100)No seed
rnorm(5); rnorm(5)
## [1] 0.7101779 -1.1056809 0.1519724 0.7825387 0.2922975
## [1] -2.4149499 -0.8790479 -1.6497132 -0.8743159 0.5449202
Seed
set.seed(1); rnorm(5)
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
set.seed(1); rnorm(5)
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
rbinom, rnorm, rpois…mvrnorm from the MASS packagesamplesample_n from the dplyr packagesample_frac from the dplyr packagehttp://simplystatistics.org/2013/03/06/the-importance-of-simulating-the-extremes/
rnbinomhttp://bioinformatics.oxfordjournals.org/content/early/2015/04/28/bioinformatics.btv272.abstract
http://bioinformatics.oxfordjournals.org/content/early/2015/02/26/bioinformatics.btv124
http://biostatistics.oxfordjournals.org/content/early/2014/01/06/biostatistics.kxt053.full
P-values, although not essential for doing tests, are nearly ubiquitous in applied work. They are a very useful "shorthand" and you should understand them.
Under the convention that larger test statistics occur farther from \(H_0\), for observed data \(Y\) we define: \[p = p(Y) = Pr_{Y'\sim F|H_0}[T(Y') > T(Y)]\] i.e. the long-run proportion of datasets, under the null, which are "more extreme" than the observed \(Y\).
What's the distribution of a \(p\)-value? Under the null:
\[Pr_{Y \sim F|H_0}[p(Y) < \alpha] = Pr_{Y \sim F|H_0}\left[Pr_{Y' \sim F|H_0}[T(Y') > T(Y)] \leq \alpha \right]\] \[= Pr{Y \sim F|H_0}[ 1 - \mathcal{F}(T(Y)) \leq \alpha]\] \[= Pr{Y \sim F|H_0}[T(Y) > \mathcal{F}^{-1}(1-\alpha)]\] \[= 1 - \mathcal{F}(\mathcal{F}^{-1}(1-\alpha)) = \alpha\]
where \(\mathcal{F}(\cdot)\) denotes the cumulative distribution function of \(T(Y)\) under the null.
There is a tremendous confusion over \(p\)-values. The following hold very generally
Suppose we observe survival times on 16 mice in a treatment and a control group: \[X = (94, 197, 16, 38, 99, 141, 23)\] \[Y = (52, 104, 146, 10, 51, 30, 40, 27, 46)\] \(X \sim F\) and the values \(Y \sim G\) and we want to test: \(H_0: F = G\). The difference of means is \(\hat{\theta} = \bar{x} - \bar{y} = 30.63\). One way to calculate a p-value would be to make a parametric assumption.
Another clever way, devised by Fisher is permutation. Define the vector \(g\) to be the group labels \(g_i = 1\) if treatment (X) and \(g_i = 0\) if control (Y). Then pool all of the observations together into a vector \(v = (94, 197, 16, 38, 99, 141, 23, 52, 104, 146, 10, 51, 30, 40, 27, 46)\)
If there are \(n\) samples from \(F\) and \(m\) samples from \(y\) for a total of \(N = m + n\) samples, then there are \({N}\choose{n}\) possible ways to assign the group labels.
Under \(F = G\) it is possible to show that - conditional on the observed values - each of these is equally likely. So we can write our statistic \[\hat{\theta} = \bar{x} - \bar{y} = \frac{1}{n} \sum_{g_i = 1} v_i - \frac{1}{m}\sum_{g_i=0} v_i\]
The permutation null distribution of \(\hat{\theta}\) is calculated by forming all permutations of \(g\) and recalculating the statistic.
\[\hat{p}_{perm} = \left(\#\{|\hat{\theta}^{b}| \geq |\hat{\theta}| \} + 1\right) /(B+1)\]